# Intro to Thompson Sampling

Thompson Sampling is a method of reinforcement learning.

# One-Armed Bandit

This method can be used to solve the following problem. Imagine you are at a casino with a row of slot machines, you know that the machines have different chances of player winning. You want to find the one with the highest win probability in the smallest amount of rounds.

Let's assume that one round costs €1 and you have €1000 to spend, in other words you will play 1000 rounds.

# Beta Function

Thompson Sampling uses something called a beta-function. It is defined as follows:

\begin{align*} &F(x) = \frac{f(x)}{\Beta} \\ &\text{where} \\ &f(x) = x^{\alpha - 1}(1-x)^{\beta - 1} \\ &\Beta = \int_0^1f(x)dx \\ &\alpha, \beta - \text{some constants} \end{align*}

You can notice that when $\alpha$ increases the graph of $F(x)$ will move to the right and when $\beta$ increases it will move to the left.

# Sampling

In the One-Armed Bandit problem we can solve it by using the amount of times we won and the amount of times we lost on each slot machine as our $\alpha$ and $\beta$ . Then using those constants to create a beta-function and sample it, then we choose the machine whose function sample was highest, which will yield the optimal machine quickly.

# Code

First we create the environment and array $X$ which will store results of each round.

import numpy as np

conversionRates = [0.15, 0.04, 0.13, 0.11, 0.05]
N = 10000
d = len(conversionRates)

X = np.zeros((N, d))
for i in range(N):
  for j in range(d):
    if np.random.rand() < conversionRates[j]:
      X[i][j] = 1

Then we can use Thompson Sampling and beta-function (distribution) to train our model. The graphs of beta-function of each machine will move to the left and right, making the one with best conversion rate to move to the right causing it to sample better and better which will make our model choose it more often.

nPosReward = np.zeros(d)
nNegReward = np.zeros(d)

for i in range(N):
  selected = 0
  maxRandom = 0
  for j in range(d):
    randomBeta = np.random.beta(nPosReward[j] + 1, nNegReward[j] + 1)
    if randomBeta > maxRandom:
      maxRandom = randomBeta
      selected = j

  if X[i][selected] == 1:
    nPosReward[selected] += 1
  else:
    nNegReward[selected] += 1